Connecting SOC with RL – Importance sampling

Alonso Cisneros

Freie Universität Berlin

Stochastic OptimalControl Uncertainty Quantification Reinforcement Learning ImportanceSampling
Stochastic OptimalControl Uncertainty Quantification Reinforcement Learning ImportanceSampling
Optimal Biasing the small leap Stochastic OptimalControl Uncertainty Quantification Reinforcement Learning ImportanceSampling

Outline

  1. Crash course on RL

  2. What is importance sampling

  • The connection to optimization
  • Optimal biasing
  1. Optimal biasing as an RL problem

  2. SOC ≠ RL

  1. The things I’d like to connect 1

Crash course on Reinforcement Learning

A miniopoly board
  • The game has a state at turn \(t\) denoted \(s_t\)
  • At a turn \(t\) players roll the dice
  • The change in money after buying/paying rent/charging rent is recorded as a reward \(r_t\)

Important

We train our robot to maximize the rewards as it takes actions exploring the space of states

Can you describe the state space?

How does it look like?

It’s hard to describe the state space, but we can study the dynamics

What if we don’t know how square transitions work?

  • We calculated transition probability with the knowledge of the dice

5 minutes to think

  • What is a reasonable way of guessing transition probabilities?
  • Can I be sure to observe even improbable but still possible states?

The right answers are:

  • By simulating the transitions and get empirical estimates
  • No

Markov Chain Monte Carlo

  • We let the robot roam around and buy squares as it pleases
    • For any square, it can either buy it or not

Importance Sampling

  • We wanted to compute the expected reward of the robot after the entire game
  • Not every problem is this well behaved
  • This property is called metastability
  • Importance sampling aims to remedy this

Important

The general idea of importance sampling is to draw random variables from another probability measure and subsequently weight them back in order to still have an unbiased estimator of the desired quantity of interest

  • We want to explore space
    • We can’t make it to the other well
  • The jumps are exponentially less probable w.r.t the height of the barrier

More formally…

  • The metastable stochastic system we sample from follows overdamped Langevin dynamics \[\begin{equation} \mathrm{d}X_s = - \nabla V(X_s) \, \mathrm{d}s + \sigma(X_s) \, \mathrm{d}W_s \end{equation}\]
  • We want to hit a target set \(\mathcal{T}\). We define \[\begin{equation} \tau = \inf \{ s > 0 \mid X_s \in \mathcal{T} \} \end{equation}\]
  • We’re interested in computing \(I: C([0, \infty), \mathbb{R}^d) \to \mathbb{R}\) \[\begin{equation} I(X) \coloneqq \exp(- \mathcal{W}(X)) \end{equation}\]

Our main goal is to compute \[ \Psi(X) \coloneqq \mathbb{E}^x [I(X)] \coloneqq \mathbb{E}[I(X) \mid X_0 = x] \]

But…

  • MCMC has terrible properties because of metastability
  • No closed form exists

Tip

  • We can “push” the particle adding force, as long as we account for it and correct for that bias
  • That “push” is achieved by adding a control \(u\).

The new, controlled dynamics are now described as \[\begin{equation} \label{eq: controlled langevin sde} \mathrm dX_s^u = (-\nabla V(X_s^u) + \sigma(X_s^u) \, u(X_s^u))\mathrm ds + \sigma(X_s^u) \mathrm dW_s, \qquad X_0^u = x \end{equation}\]

Via Girsanov, we can relate our QoI to the original as such: \[\begin{equation} \label{eq: expectation IS} \mathbb{E}^x\left[I(X)\right] = \mathbb{E}^x\left[I(X^u) M^u\right], \end{equation}\]

where the exponential martingale \[\begin{equation} \label{eq: girsanov martingale} M^u \coloneqq \exp{\left(- \int_0^{\tau^u} u(X_s^u) \cdot \mathrm dW_s - \frac{1}{2} \int_0^{\tau^u} |u(X_s^u)|^2 \mathrm ds \right)} \end{equation}\] corrects for the bias the pushing introduces.

Important

The previous relationship always holds. But the variance of the estimator depends heavily on the choice of \(u\).

Clearly, we aim to achieve the smallest possible variance through on optimal control \(u^*\) \[\begin{equation} \label{eq: variance minimization} \operatorname{Var} \left( I(X^{u^*}) M^{u^*} \right) = \inf_{u \in \mathcal{U}} \left\{ \operatorname{Var} (I(X^u) M^u) \right\} \end{equation}\]

Connection to optimization

It turns out 1 that the problem of minimizing variance corresponds to a problem in optimal control

The cost functional \(J\) to find the variance minimizing control is \[\begin{equation} \label{eq: cost functional} J(u; x) \coloneqq \mathbb{E}^x\left[\mathcal{W}(X^u) + \frac{1}{2} \int_0^{\tau^u} |u(X_s^u)|^2 \mathrm ds \right], \end{equation}\]

So that \[\begin{equation} \Phi(x) = \inf_{u \in \mathcal{U}} J(u; x). \end{equation}\]

Important

The optimal bias achieves zero variance: \[\begin{equation} \operatorname{Var} \left( I(X^{u^*}) M^{u^*} \right) = 0. \end{equation}\]

Optimal biasing through RL

  • Let’s reconsider the SOC problem (excuse the change in notation)
  • We discretize with Euler–Maruyama \[\begin{align*} s_{t+1} &= s_t + \left( -\nabla V(s_t) + \sigma u(s_t)\right) \Delta t + \sigma \, \sqrt{\Delta t} \, \eta_{t+1} \\ s_0 &= x \end{align*}\]

The time-discretized objective function is given by \[\begin{equation} J(u; x) \coloneqq \mathbb{E}^{x} \left[ g(s_{T_u}) + \sum_{t=0}^{T_{u-1}} f(s_t) \Delta t + \frac{1}{2} \sum_{t=0}^{T_{u-1}} |u(s_t)|^2 \Delta t \right] \end{equation}\]

  • Our stopping time \(\tau\) is now denoted \(T_u\)

Some formalities

  • The state space \(\mathcal{S}\) is all possible \(s \in \mathbb{R}^d\)
  • The action space \(\mathcal{A}\) is the codomain of all possible controls \(\mathbb{R}^d\)
  • The stopping time \(T_u\) for the controlled process is a.s. finite
  • We’ll approximate the control with Galerkin projections \(u_\theta\)
  • We still need to derive probability transition and reward functions
  • The return we want to optimize depends on a rewards function \[ r_t = r(s_t, a_t) \coloneqq \begin{cases} - f(s_t) \Delta t - \frac{1}{2} |a_t|^2 \Delta t & \text{if} \; s_t \notin \mathcal{T} \\ -g(s_t) & \text{if} \quad s_t \in \mathcal{T}. \end{cases} \]

Future work

References

Quer, J., and Enric Ribera Borrell. 2024. “Connecting Stochastic Optimal Control and Reinforcement Learning.” Journal of Mathematical Physics 65. https://doi.org/10.1063/5.0140665.
Ribera Borrell, Enric, Jannes Quer, Lorenz Richter, and Christof Schütte. 2024. “Improving Control Based Importance Sampling Strategies for Metastable Diffusions via Adapted Metadynamics.” SIAM Journal on Scientific Computing 46 (2): S298–323. https://doi.org/10.1137/22M1503464.